Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce unnecessary file downloading in archive/file states #43518

Merged
merged 6 commits into from
Sep 22, 2017

Conversation

terminalmage
Copy link
Contributor

@terminalmage terminalmage commented Sep 15, 2017

The file.managed state, which is used by the archive.extracted state to download the source archive, at some point recently was modified to clear the file from the minion cache. This caused unnecessary re-downloading on subsequent runs, which slows down states considerably when dealing with larger archives.

This PR makes the following changes to improve this:

  1. The fileclient now accepts a source_hash argument, which will cause the client's get_url function to skip downloading http(s) and ftp files if the file is already cached, and its hash matches the passed hash. This argument has also been added to the cp.get_url and cp.cache_file functions.

  2. We no longer try to download the file when it's an http(s) or ftp URL when running file.source_list.

  3. Where cp.cache_file is used, we pass the source_hash if it is available.

  4. A cache_source keep_source argument has been added to the file.managed state, defaulting to True. This is now used to control whether or not the source file is cleared from the minion cache when the state completes.

  5. Two new states (file.cached and file.not_cached) have been added to manage files in the minion cache.

In addition, the archive.extracted state has been modified in the following ways:

  1. For consistency with file.managed, a cache_source keep_source argument has been added. This also deprecates keep. If keep is used, keep_source assumes its value, and a warning is added to the state return to let the user know to update their SLS.

  2. The variable name cached_source (used internally in the archive.extracted state) has been renamed to cached to reduce confusion with the new cache_source keep_source argument. (Note: while cache_source is now keep_source, I'm not going to go back and rename these variables back to cached_source. They will stay as-is.)

  3. The new file.cached and file.not_cached states are now used to manage the source tarball instead of file.managed. This improves disk usage and reduces unnecessary complexity in the state as we no longer keep a copy of the archive in a separate location within the cachedir. We now only use the copy downloaded using cp.cache_file within the file.cached state. This change has also necessitated a new home for hash files tracked by the source_hash_update argument, in a subdirectory of the minion cachedir called archive_hash.

Resolves #38971.

@damon-atkins
Copy link
Contributor

damon-atkins commented Sep 15, 2017

I have had to use cp to cache a dir then had to use cp.is_cached to get a path to a file. Is there a need for cp.cache_local_path (_extrn_path) so can get file name translation from source to local cache?

With regards to caching http and http, can we use If-Modified-Since or keep a copy of the header and compare key values in the header to determine of the file needs to be downloaded again. Maybe cache/.../extern_meta which contains the http/https headers for extrn_files

Other thing which might help is keep the time stamp on the file the same as the time stamp on the source if possible.

I can raise these idea's as an issue if you like.

@terminalmage
Copy link
Contributor Author

terminalmage commented Sep 15, 2017

I have had to use cp to cache a dir then had to use cp.is_cached to get a path to a file. Is there a need for cp.cache_local_path so can get file name translation from source to local cache?

cp.is_cached will give the cached path if the file is cached, and an empty string if it is not cached. Why is this not sufficient?

With regards to caching http and http seem it needs to use If-Modified-Since or keep a copy of the header and compare key values in the header to determine of the file needs to be downloaded again. Maybe cache/.../extern_meta which contains the http/https headers for extrn_files

Other thing which might help is keep the time stamp on the file the same as the time stamp on the source if possible.

I honestly have no idea what you're talking about here, sorry.

The source_hash will be used to drive whether or not the file is downloaded. If the locally-cached file matches the configured source_hash, we don't download. Before, we'd download every time, and then compare the hash to the source_hash. If it didn't match, we'd fail the state anyway, so there would be a lot of unnecessary downloading for what result in a failed state. This would allow the state to continue to work if the remote file changes, and if you want the new copy of the file, you just update the source_hash.

@damon-atkins
Copy link
Contributor

damon-atkins commented Sep 15, 2017

The follow are two example of http response headers
From GET https://the.earth.li/~sgtatham/putty/0.70/w64/putty-64bit-0.70-installer.msi

HTTP/1.1 200 OK
Date: Fri, 15 Sep 2017 03:11:50 GMT
Server: Apache
Last-Modified: Tue, 04 Jul 2017 19:36:06 GMT
ETag: "2e8600-55382fea0ec86"
Accept-Ranges: bytes
Content-Length: 3048960
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: application/x-msi

From inkscape.

HTTP/1.1 200 OK
Server: nginx/1.10.3
Content-Type: application/octet-stream
Last-Modified: Mon, 07 Aug 2017 16:11:49 GMT
ETag: "59889145-4f41ab7"
Expires: Tue, 10 Oct 2017 11:02:31 GMT
Cache-Control: max-age=2592000
Content-Disposition: attachment; filename="inkscape-0.92.2-x64.msi"
Content-Length: 83106487
Accept-Ranges: bytes
Date: Fri, 15 Sep 2017 03:04:31 GMT
Via: 1.1 varnish
Connection: keep-alive
X-Served-By: cache-mel6525-MEL
X-Cache: HIT
X-Cache-Hits: 0
X-Timer: S1505444671.241058,VS0,VE1

For example if Content-Length, ETag and Last-Modified have not change the existing cached file is ok.
Cache-Control could be used to clean out old files.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since
The the putty example when requesting the file "GET"
If-Modified-Since: Tue, 04 Jul 2017 19:36:06 GMT
The web servers responds with 304 Not Modified.

Or using the etag
If-None-Match: "2e8600-55382fea0ec86"
The web servers responds with 304 Not Modified.

Or send both.

Or use HEAD first and compare and then GET (fetch) if needed, which might be easier to code.
curl -I https://the.earth.li/~sgtatham/putty/0.70/w64/putty-64bit-0.70-installer.msi

HTTP/1.1 200 OK
Date: Fri, 15 Sep 2017 03:22:01 GMT
Server: Apache
Last-Modified: Tue, 04 Jul 2017 19:36:06 GMT
ETag: "2e8600-55382fea0ec86"
Accept-Ranges: bytes
Content-Length: 3048960
Content-Type: application/x-msi

By keeping the header response from GET, it allows minion to determine of the file is valid in the cache or not.

cache/.../extern_meta/...../putty-64bit-0.70-installer.msi would contain the Header from the original GET request.

HTTP/1.1 200 OK
Date: Fri, 15 Sep 2017 03:11:50 GMT
Server: Apache
Last-Modified: Tue, 04 Jul 2017 19:36:06 GMT
ETag: "2e8600-55382fea0ec86"
Accept-Ranges: bytes
Content-Length: 3048960
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: application/x-msi

If the HEAD request does not match the GET Header response in cache/.../extern_meta/...../putty-64bit-0.70-installer.msi when comparing Last-Modified, ETag, Content-Length, and Content-Type then fetch the file again, instead of using the cache version. But it must contain Last-Modified or ETag or both to be able to do a valid comparison. If the header missing Last-Modified and ETag and or Content-Length general means the contents is dynamic.

@ari
Copy link
Contributor

ari commented Sep 15, 2017

My feedback is that this change is unnecessarily complex and the documentation very lacking.

  1. Why rename keep to cache_source? These sorts of arbitrary changes to suit the style of a new developer make upgrading salt more painful than necessary.

  2. The concept file.cached and its interaction with other file states isn't clear to me. Can't we just return the old behaviour of file states with regard to the keep attribute? That is, don't clear the file from the minion (if necessary by putting it or a hash into a new archive_hash folder -- that's poorly named by the way since it isn't specific to archives).

  3. If file.cached is really important and helpful, then it needs much more documentation about how to use it in conjunction with other file states.

@cachedout
Copy link
Contributor

@terminalmage I'm wondering here about 2017.7 versus develop. Seems like the latter might be a better target but let me know your thoughts.

@saltstack saltstack deleted a comment from KumarRajnish Sep 19, 2017
@terminalmage
Copy link
Contributor Author

Why rename keep to cache_source? These sorts of arbitrary changes to suit the style of a new developer make upgrading salt more painful than necessary.

It's not an arbitrary change. The new caching behavior doesn't apply just to the archive.extracted state, it also applies to the file.managed state. An argument named keep doesn't make any sense in file.managed. It's just not descriptive enough, and I think I can safely make that declaration having been a contributor for almost six years, and having been employed full-time by SaltStack for four of those years.

I'm more than willing to have my fellow long-time contributors weigh in on this though, and use keep in both archive.extracted and file.managed. Or, perhaps, I can modify the code such that either keep or cache_source are valid arguments to archive.extracted, and both do the same thing. @cachedout thoughts?

The concept file.cached and its interaction with other file states isn't clear to me. Can't we just return the old behaviour of file states with regard to the keep attribute? That is, don't clear the file from the minion (if necessary by putting it or a hash into a new archive_hash folder -- that's poorly named by the way since it isn't specific to archives).

Using a file.managed state to download the archive was never a good idea in the first place. The reason for this is because it first downloads the file into the cachedir, and then places a copy of it in the final destination. Managing files within the cachedir using file.managed is just not a good idea, that's not what it was designed to do. So, to work around this, the archive.extracted state would create a separate path to use as the final destination for the cached archive file.

The new file.cached state simplifies what we need to do to ensure that the archive file is present in the cachedir by simply ensuring that copy has been cached using cp.cache_file, without the need to devise an alternate destination for the cached archive.

With regard to the archive_hash directory being "poorly named", I'm sorry but you are just incorrect. This is a brand new directory being added just for the archive.extracted state, and it is only used if the source_hash_update argument is set to True. source_hash_update is used to force the archive to be extracted if the file paths within the archive don't change but the hash of the source archive does. In the past, we would place this hash file alongside the separate path we had created for the copy of the archive we had downloaded using file.managed. Since we're not using file.managed anymore for this, we need a new home for these hash files, and placing them in the normal path under the cachedir could result in collisions if we do a cp.cache_file on a URL which would end up being cached to the same path. Thus, the new archive_hash directory which, yes, is specific to archives.

If file.cached is really important and helpful, then it needs much more documentation about how to use it in conjunction with other file states.

Since it's designed to be used internally within state functions, and not "in conjunction" with them, I don't know how critical adding a bunch of documentation is. I can by all means add a couple code examples, but this state will rarely be useful in an SLS file, and I have already mentioned in the docstring for file.cached that it is designed for internal use within states.

@terminalmage
Copy link
Contributor Author

I've added some code examples to the docstring for the file.cached state. I also noticed an issue with invoking a state function via __states__ instead of via a HighState instance (as we used to do), which I've also corrected.

@cachedout The unnecessary re-downloading started happening because of changes to file.managed in 2017.7.0, so I considered this a regression that needed fixing in the 2017.7 release branch. If you'd rather push this to the next feature release though, I can understand.

@cachedout
Copy link
Contributor

@terminalmage I suppose it depends on our view of performance regressions. My view here is that changes to the file client are so fundamental to the behavior and stability of Salt that I like to take the more conservative route and defer them to major releases. I'm not necessarily opposed to merging it here but I'd feel more comfortable in the develop branch.

@terminalmage
Copy link
Contributor Author

@cachedout I don't have an objection to rebasing against develop.

@ari
Copy link
Contributor

ari commented Sep 20, 2017

@terminalmage

An argument named keep doesn't make any sense in file.managed. It's just not descriptive enough, and I think I can safely make that declaration having been a contributor for almost six years, and having been employed full-time by SaltStack for four of those years.

Thank you for all your work on this project. It has been something very important to me for the last two years.

I come from the perspective of a user of salt, who struggled to first learn and understand its concepts and live in fear of every upgrade and what part of my configuration will be broken. I love the concepts in salt, but hate its fluid nature and lack of documentation, particularly examples of best practice.

With that out of the way, I get that you feel 'cache_source' is clearer, but to me it isn't. That's just one person so make of it what you will. When I read that I see "cache" as a noun, not a verb, so I expect to put a path as the value, not a boolean. "keep" is always a verb. But more importantly, breaking my working states with each upgrade because of a better name isn't very helpful to your users. It might be that "ls" isn't the most intuitive way to get a directory listing in Unix, there is no discussion about changing it now.

Remember that every time you change some naming you break not only everyone's working system (yes, everyone should read the release notes in detail and understand every implication, but they don't), but also you invalidate all documentation, forum posts and other information already out there.

Perhaps this is a little painting the bike shed, but for something as business critical as a configuration management system, clarity and consistency are so so important.

Thanks for listening, and thanks for your detailed explanation. I hope that you could put even a fraction of what you explained to me in the official documentation. Your description of archive_hash could go into the docs almost as is. Everyone would be much the wiser.

@terminalmage
Copy link
Contributor Author

The keep usage still works, albeit with a warning. It doesn't "break everyone's working system".

I hope that you could put even a fraction of what you explained to me in the official documentation. Your description of archive_hash could go into the docs almost as is. Everyone would be much the wiser.

Buy why? The choice of where we store the hash files tracked by source_hash_update (i.e. archive_hash, as changed in this PR) is not user-facing. It's entirely internal. What is the benefit of explaining something like this when it will almost certainly never need to be interacted with by users? The docs are a user reference, not a developer's reference.

@terminalmage
Copy link
Contributor Author

I've modified the PR so that cache_source is now keep_source, and I've gotten rid of the warning when keep is used. If both keep_source and keep are used, keep is ignored.

@cachedout I've also rebased onto develop. I think I have a way of incorporating the non-fileclient fixes from this PR against 2017.7, which I can do in a separate PR.

@terminalmage terminalmage removed ZRELEASED - 2017.7.3 bugfix-bckport will be be back-ported to an older release branch by creating a PR against that branch labels Sep 21, 2017
@terminalmage terminalmage force-pushed the issue38971 branch 3 times, most recently from be9bdb6 to 9d749af Compare September 21, 2017 07:32
The file.managed state, which is used by the archive.extracted state to
download the source archive, at some point recently was modified to
clear the file from the minion cache. This caused unnecessary
re-downloading on subsequent runs, which slows down states considerably
when dealing with larger archives.

This commit makes the following changes to improve this:

1. The fileclient now accepts a `source_hash` argument, which will cause
   the client's get_url function to skip downloading http(s) and ftp
   files if the file is already cached, and its hash matches the passed
   hash. This argument has also been added to the `cp.get_url` and
   `cp.cache_file` function.

2. We no longer try to download the file when it's an http(s) or ftp URL
   when running `file.source_list`.

3. Where `cp.cache_file` is used, we pass the `source_hash` if it is
   available.

4. A `cache_source` argument has been added to the `file.managed` state,
   defaulting to `True`. This is now used to control whether or not the
   source file is cleared from the minion cache when the state
   completes.

5. Two new states (`file.cached` and `file.not_cached`) have been added
   to managed files in the minion cache.

In addition, the `archive.extracted` state has been modified in the
following ways:

1. For consistency with `file.managed`, a `cache_source` argument has
   been added. This also deprecates `keep`. If `keep` is used,
   `cache_source` assumes its value, and a warning is added to the state
   return to let the user know to update their SLS.

2. The variable name `cached_source` (used internally in the
   `archive.extracted` state) has been renamed to `cached` to reduce
   confusion with the new `cache_source` argument.

3. The new `file.cached` and `file.not_cached` states are now used to
   manage the source tarball instead of `file.managed`. This improves
   disk usage and reduces unnecessary complexity in the state as we no
   longer keep a copy of the archive in a separate location within the
   cachedir. We now only use the copy downloaded using `cp.cache_file`
   within the `file.cached` state. This change has also necessitated a
   new home for hash files tracked by the `source_hash_update` argument,
   in a subdirectory of the minion cachedir called `archive_hash`.
…e.extracted

The code used to have a salt.state.HighState instance call the state.
That method pre-dated availability of __states__ to use for executing
the state function. The HighState instance handles exception catching
and produces a list as output if there were errors which arose before
the state was executed. Running the state function using __states__ does
not give you any such protection. This commit removes the type check on
the return data, as it will never be a list when run via __states__, and
wraps the state function in a try/except to catch any exceptions that
may be raised by invoking the file.cached state.
This new argument name is ambiguous as "cache" can either be interpreted
as a noun or a verb.
@max-arnold
Copy link
Contributor

Etag support has been implemented: #61764 #61391

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants